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ABSTRACT 

Runtime time preprocessing plays a major role in many efficient algorithms in computer 
science, as well as playing an important role in exploiting multiprocessor architectures. We 
give examples that elucidate the importance of run time preprocessing and show how these 
optimizations can be integrated into compilers. To support our arguments, we describe 
transformations implemented in prototype multiprocessor compilers and present benchmarks 
from the iPSC2/860, the CM-2, and the Encore Multimax/320. 
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1 Introduction 

1.1 Overview 

In many algorithms, data produced or input during a program’s initialization plays a large 
role in determining the nature of the subsequent computation. When the data structures 
that define a computation have been initialized, a preprocessing phase follows. Vital elements 
of the strategy used by the rest of the algorithm are determined by this preprocessing phase. 

To effectively exploit many multiprocessor architectures, we may also have to carry out 
run time preprocessing. This preprocessing will be referred to as runtime compilation . The 
purpose of runtime compilation is not to determine which computations are to be performed 
but instead to determine how a multiprocessor machine will schedule the algorithm’s work, 
how to map the data structures and how data movement within the multiprocessor is to be 
scheduled. In this paper, we specifically address problems for which computational patterns 
can be predicted when values assigned to key data structures are known. These problems 
include computations on non-uniform meshes, sparse direct factorization which does not 
involve pivoting and sparse iterative linear solvers. 

Values obtained during program execution can affect the nature and degree of potential 
concurrency. Runtime compilation may be needed to identify and exploit concurrency . Com- 
plex heterogeneous memory hierarchies characterize virtually all multiprocessor architectures 
with more than a few dozen processors. Primary memory is divided among processors. To 
obtain data from other portions of the primary memory of the multiprocessor, we typically 
need to access a communications network. Program performance can be dramatically affected 
by the scheduling of data movement among processors . 

There has been much research carried out on methods for runtime parallelization as well 
as runtime workload and data partitioning. Most parallelization and problem partitioning 
methods explicitly or implicitly specify patterns of interprocessor communication. When 
patterns of computation are determined by data structures initialized during program ex- 
ecution, traditional compiler techniques cannot possibly carry out these partitioning and 
scheduling operations. Only recently have methods been developed that can integrate the 
kinds of runtime optimizations mentioned above into compilers and programming environ- 
ments. 

2 Algorithmic Execution Time Preprocessing 

In many efficient approaches to solving problems in computing, data produced or input dur- 
ing program execution plays a large role in determining computational patterns. Examples 
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include: 


• Most searching and sorting problems 

• Critical path analysis 

• Game tree and decision tree manipulations 

• Direct and iterative sparse linear system solvers 

Once an appropriate subset of the input (or generated) data is available, it is frequently 
worthwhile to perform some preprocessing. This preprocessing can take many forms, but 
results of the preprocessing determine vital elements of the strategy used by the remainder 
of the algorithm. A simple example of this is the method of interpolated binary search. 
The number of computations required for a simple binary search of a sorted list depends on 
the values of the elements in the list and on the value of the key. We can preprocess the 
sorted list and use the distribution of element values in the list to produce an interpolation 
function that is used to direct the search. It is frequently possible to amortize the cost of 
preprocessing. In the interpolated binary search example, once preprocessing is carried out, 
we can used the interpolation function to search for a sequence of different keys. 

Some other examples of well known algorithms that carry out preprocessing to determine 
vital elements of the strategy used by the remainder of the algorithm are: 

a. Creation of indices to speed database retrieval where indices are created to allow the 
use of efficient search methods on many different database keys [30]. 

b. Generation of threaded binary search trees where extra links are added to a binary 
tree to speed tree traversal [16]. 

c. Matrix reordering and symbolic factorization used in sparse direct linear equation 
solvers. In such problems, the number and pattern of computations in a sparse matrix 
factorization is determined by the order in which steps in the factorization are carried 
out. In many cases it is possible to use the non-zero structure of a matrix to predeter- 
mine the order in which computations will be carried out and to allocate the memory 
needed to store the resulting factored matrix [12]. 

In each of these examples, the results of a single preprocessing computation can be used to 
solve any member of a class of structurally similar problems. In the database example, the 
creation of an index can be followed by an arbitrary number of queries. Once a threaded 
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binary search tree is generated, the resulting data structure can be used in an arbitrary num- 
ber of tree traversals. A symbolic matrix factorization can be used to speed the factorization 
of any matrix with a given pattern of non-zero entries. 

Runtime compilation techniques attempt to discover how to maximize the performance 
of algorithms on multiprocessors. Since these methods are particularly useful in algorithms 
whose computational patterns depend on values assigned to data structures during program 
execution, a significant preprocessing cost is frequently involved. In runtime compilation, we 
are also often able to amortize costs of preprocessing among a number of structurally similar 
computational phases. 

3 Run-Time Parallelization 

Run-time parallelization is perhaps the most obvious form of multiprocessor runtime compila- 
tion. Parallelization carried out during compilation is necessarily conservative. If a compiler 
cannot figure out how to generate a correct parallelizing loop transformation, loop iterations 
have to be performed sequentially. Many loop nests defy compile-time parallelization be- 
cause dependency patterns are determined by variables or arrays initialized during program 
execution. One way of carrying out runtime parallelization is to analyze the inter-iteration 
dependency pattern in a loop nest to identify wavefronts of concurrently executable loop iter- 
ations. Using a form of run time preprocessing, we transform a loop nest with inter-iteration 
dependencies into a sequence of parallel loops. Execution time preprocessing is frequently 
used to parallelize sparse numerical algorithms, such as those arising in sparse direct and 
iterative linear solvers [2], [4], [24], [11], [1]. 

Typically, programmers need to explicitly code the procedures that carry out the nec- 
essary run time preprocessing. It is possible to produce a runtime parallelization program 
transformation that generates code designed to perform run-time loop parallelization [28]. 
The compiler transforms a loop into two separate code segments. The first code segment, 
the inspector , finds sets of independent loop iterations while the second code segment, the 
executor , carries out the scheduled work. Runtime parallelization transformations have been 
implemented in a prototype compiler targeted at shared memory machines [29]. Runtime 
compilation only handles a subset of the possible types of runtime parallelization. Our trans- 
formations only apply to loop nests in which inter-iteration dependencies do not depend on 
the results of computations carried out within the loop nest. There are a number of al- 
gorithms that merge the process of identifying and performing concurrent work. It seems 
likely to us that it will be possible to produce compiler transformations that generate hybrid 
inspector/ executors for more fully dynamic algorithms, but we will not address this issue 
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do i=l,N 


y(i) = a(i)*y( ia (i)) + b (i)M ib (i)) 

end do 


Figure 1: Sequential Code to be Parallelized 


further in this paper. 

To clarify the scheme, we now present a simple example. A simple sequential program 
is presented in Figure 1. Note that right hand side references to array y use a level of 
indirection. The inspector used to perform runtime parallelization (Figure 2) is simply a 
topological sort. This sort can be generated from the parse tree produced by the loop in 
Figure 1. The inspector in Figure 2 is sequential but can be parallelized using the principles 
to be described later in this section. Once the wavefront corresponding to each index is 
known, we can sort the indices in order of increasing wavefront number to produce the array 
schedule. The inspector also initializes a pointer array count. Array count contains the 
address in schedule of the beginning of each wavefront. Loop iterations corresponding to 
wavefront i are found in schedule between count (i) and count (i+1 ) — 1. 

The executor in Figure 3 is a sequence of parallel do loops that run over consecutive 
wavefronts obtained by the inspector from the sequential code in Figure 1. Note that to 
obtain the correct solution in the executor we need to maintain two copies of the array y 
found in the sequential code. In Figure 1 we call these copies y and ynew. 

In evaluating the usefulness of run-time parallelization, the cost of the preprocessing must 
be taken into account. In [29] we present timings obtained from the run-time parallelization 
transformation applied to sparse lower triangular solves. On an 18 processor Encore Mul- 
timax/320, a single processor required 241 milliseconds to solve a lower triangular system 
obtained from an incomplete factorization of one of the Boeing Harwell test matrices. On 
16 processors of the Multimax, the inspector required 100 milliseconds and the executor re- 
quired 23 milliseconds. In many situations, we can amortize the cost of an inspector because 
we need to repeatedly carry out a given pattern of computations. For instance, in iterative 
linear systems solvers we may need to repeatedly solve the same sparse triangular systems 
with different right hand sides. 

A variety of tradeoffs can be made between the costs and benefits of preprocessing. We 
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wf(l:N) = 0 
do i=l,N 

wf(i) = max(wf(i),wf(ia(i)),wf(ib(i))) + 1 
end do 

Use wf() to produce schedule(), a list of indices in order of increasing wavefront number 

Figure 2: Parallelizing Inspector 


do phase = 1, np 

parallel do i=count(phase),count(phase+l)-l 
ii = schedule(i) 

if(ia(ii).lt.ii) then tmpl = ynew(ia(ii)) 
else tmpl = y(ia(ii)) endif 

if(ib(ii).lt.ii) then tmp2 = ynew(ib(ii)) 
else tmp2 = y(ib(ii)) endif 

ynew(ii) = a(ii)*tmpl + b(ii)*tmp2 
end parallel do 

end do 

y(l:n) = ynew(l:n) 

Figure 3: Parallelizing Executor 
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can dispense with reordering loop iterations into concurrent wavefronts and still be able to 
exploit parallelism to a degree by using a preprocessed doacross transformation [27]. In a 
doacross construct [9] , loop iterations are partitioned between processors in a striped fashion 
and synchronization calls are introduced so that computations from some loop iterations can 
be overlapped. Doacross loops typically make use of a-priori knowledge of inter-iteration 
dependencies to carry out needed inter-iteration synchronizations. It is possible to carry 
out a relatively small amount of run time preprocessing and postprocessing that eliminates 
the need for a-priori knowledge of dependencies. On machines with snooping caches (such 
as the Multimax/320), it is efficient to synchronize using shared arrays. The following is a 
sketch of some of the transformations involved in generating preprocessed doacross loops, a 
much more detailed description may be found in [27]. A shared array ready is initialized 
to NOTDONE. When a left hand side array element i is calculated, ready (i) is set to DONE. 
Processors needing to use an updated value of array element i busy wait on ready (i) until 
ready (i) is set to DONE. In preprocessed doacross loops, we need to maintain two copies of 
shared arrays that appear on the left hand side of expressions during the computation. After 
the loop is completed, the two shared array copies need to be reconciled. 

The run time initialization and postprocessing in the preprocessed doacross loop are rel- 
atively inexpensive compared to the preprocessing costs incurred by a parallelizing inspector 
(e.g. figure 2). For the above cited lower triangular solve involving the incompletely factored 
Boeing Harwell test matrix, the preprocessed doacross loop requires 45 milliseconds. This 
can be compared to the 23 milliseconds required to carry out the runtime parallelized solve 
and the 100 millisecond preprocessing time of the inspector. 

Runtime parallelization can be carried out on a variety of architectures. In this paper, 
we discuss runtime parallelization only in the context of shared memory architectures; a 
discussion of runtime parallelization for distributed memory machines is found in [26]. 

4 Runtime Compilation for Distributed Memory Ma- 
chines 

4.1 Distributed Memory Inspectors and Executors 

In distributed memory machines, large data arrays need to be partitioned between local 
memories of processors. These partitioned data arrays are called distributed arrays. We 
follow the usual practice of assigning long term storage of distributed array data to specific 
memory locations in the distributed machine. A processor that needs to read an array 
element must fetch a copy of that element from the memory of the processor in which that 
array element is stored. Alternately, a processor may need to store a value in an off-processor 
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Each processor P: 


— preprocesses its own loop iterations 

— Records off-processor fetches and stores in hashed cache 

— Finds send/receive calls required for data exchange 

p generates list of all off-processor data to be fetched 

(ii) Other processors tell P which data to send 

(iii) Send/Receive pairs generated and stored 


Figure 4: Inspector For Parallel Loop on Distributed Memory Multiprocessor 

distributed array element. Local copies of off-processor distributed array elements are stored 
in hash tables called hashed caches. Run-time procedures carry out the movement of data 
between processors and manage the above mentioned hash tables. 

In distributed memory MIMD architectures, there is typically a non-trivial communi- 
cations latency or startup cost [7]. For efficiency reasons, information to be transmitted 
should be collected into relatively large messages. The cost of fetching array elements can 
be reduced by precomputing what data each processor needs to send and to receive. 

In Figure 4, we outline the preprocessing we performed to implement a parallel loop on a 
distributed machine. The distribution of parallel loop indices to processors determines where 
computations are to be performed. We assume that all needed distributed arrays have been 
defined and initialized and that loop iterations have been partitioned between processors. 
Using the hashed cache to record off-processor fetches and stores allows us to recognize when 
more than one reference is being made to the same off-processor distributed array element, 
so that only one copy of that element need be fetched or stored. 

During our inspector phase, we carry out a set of interprocessor communications that 
allows us to anticipate exactly which send and receive communication calls each processor 
must execute so that all interprocessor data transmission is correctly carried out. By contrast, 
if individual fetches and stores were to be carried out during the actual computation, things 
would be much more awkward. For example, in such a case processor A might obtain the 
contents of a distributed array element which is not on A by sending a message to processor 
B associated with the array element. Processor B would be programmed to anticipate a 
request of this type, to satisfy the request and to return a responding message containing 
the contents of the specified array element. 
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• Before loop or code segment 


(i) Data to be sent off-processor read from distributed arrays 

(ii) Send/receive calls transport off-processor data 

(iii) Data written into hashed cache 

• Computation carried out 

- off-processor reads/writes go to hashed cache 


• At end of loop or code segment 

(i) Data to be stored off-processor read from hashed cache 

(ii) Send/receive calls transport off-processor data 

(iii) Data written back into distributed arrays for longer term storage 

Figure 5: Executor For Parallel Loop on Distributed Memory Multiprocessor 

Once preprocessing is completed, we are in a position to carry out the necessary commu- 
nication and computation, Figure 5 outlines the steps involved. The initial data exchange 
phase follows the plan established by the inspector. During preprocessing, each proces- 
sor finds out which distributed array elements need to be transmitted. When a processor 
obtains copies of off-processor distributed array elements, the copies are written into the 
processor s hashed cache. Once the communication phase is over, each processor carries out 
its computation. Each processor uses locally stored portions of distributed arrays along with 
off-processor distributed array elements stored in the hashed cache. When the computational 
phase is finished, distributed array elements to be stored off-processor are obtained from the 
hashed cache and sent to the appropriate off-processor locations. 

There are many situations in which simple, easily specified distributed array partitions 
are inappropriate. For instance when we compute using an unstructured mesh, we attempt 
to partition the problem so that each processor performs approximately the same amount of 
work and so that the communications overhead is minimized. Typically, it is not possible to 
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Each processor P: 


— preprocesses its own loop iterations 

— Records off-processor fetches and stores in hashed cache 

— Consults distributed translation table to 

* Find location in distributed memory for each off-processor fetch or store 

— Finds send/receive calls required for data exchange 

(i) P generates list of all off-processor data to be fetched 

(ii) Other processors tell P which data to send 

(iii) Send/Receive pairs generated and stored 


Figure 6: Inspector For Parallel Loop Using Irregular Distributed Array Mapping 

express the resulting array partitions in a simple way. If we allow an arbitrary assignment of 
distributed array elements to processors, the data structure used to describe the partitioning 
will have the same number of elements as the distributed array. 

In order to access an array element, we need to know where the element is stored in 
the memory of the distributed machine. We use a distributed translation table defined by 
a partitioning algorithm, to describe the mapping. When a distributed translation table 
is used to describe array mappings, inspectors must be modified so that they access the 
distributed table. Using an irregular array mapping does not alter the form of the executor. 
The modifications to be made to an inspector are outlined in Figure 6. 

4.2 Languages and Tools for Irregular Problems 

Programs designed to carry out sparse direct and iterative methods also typically require 
many of the optimizations described in Section 4.1. Some examples of such programs are 
described in [3], [18], [15], [4]. Williams [34] describes a programming environment (DIME) 
for calculations with unstructured triangular meshes using distributed memory machines. In 
[34], collections of distributed array accesses are translated into an efficient set of inter- node 
messages. The DIME programming environment embodies many of the principles discussed 
in Section 4.1. The optimizations discussed in the last section can be incorporated into 
distributed compilers. Runtime compilation for distributed machines was proposed in [25]; 
this description was in the context of the Crystal language. Distributed memory runtime 
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compilation was expanded upon in [20]; which outlines the principals behind the PARTI 
project. A more detailed description of the concepts behind distributed memory runtime 
compilation is found in [26], and [21]. The idea of splitting a loop into an inspector and 
executor and integrating this into a compiler was also developed independently as part of 
the KALI project [17]- Other compiler projects have also proposed run time resolution of 
communications on distributed machines [8], [22], [23]. These compilers do not carry out 
the kinds of run time optimizations of the sort described here. 

We have designed a set of procedures or primitives that do the work needed to implement 
inspectors and executors. We have also designed and implemented a model compiler that 
recognizes a subset of Fortran (ARF - ARguably Fortran) and generates inspector and ex- 
ecutor loops with embedded primitives. Distributed arrays can be declared in ARF source. 
These distributed arrays can either be partitioned between processors in a uniform manner 
(e.g. equal sized blocks of contiguous array elements assigned to each processor), alternately, 
distributed arrays can be partitioned in an irregular manner. Wfien an array is to be par- 
titioned in an irregular fashion, mapping information is specified in an integer array. This 
integer array is typically produced by a partitioning procedure. Element i of the integer ar- 
ray describes the processor to which element i of the distributed array is to be mapped. For 
example, consider the ARF declaration, distributed irregular using map real y(4). 
This declaration denotes a four element real array y that is to be distributed according to 
integer array map. 

Embedded primitives include communications procedures designed to support irregular 
patterns of distributed array access. Other primitives that involve interprocessor commu- 
nication initialize distributed translation tables or access distributed translation tables to 
find the location of irregularly mapped distributed array data. Primitives also support the 
maintenance of hashed caches. (Recall from Section 4.1 that hashed caches store copies of 
off-processor distributed array data.) There are also PARTI primitives that perform accu- 
mulations to off-processor distributed array elements. 

In Figure 7 we present a simple example of an ARF program. The procedure to be 
presented is a block sparse matrix vector multiply, obtained from an iterative solver produced 
for a program designed to calculate fluid flow for geometries defined by an unstructured mesh 
[31]. The ARF compiler uses information in integer array row to make calls to primitives 
that initialize the distributed translation tables. These distributed translation tables are 
used to describe the mapping of x, y , cols and ncols (statements Si and S2). The 
current version of ARF distributes only the last declared dimension in a multidimensional 
array, although the PARTI primitives do support a broader class of array mappings [6]. 
As of the time of writing, the ARF compiler does not include syntax that specifies where 
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computational work is to be performed. Partitioning procedures specify where work is to be 
carried out, but the interface between partitioning procedures and ARF has not yet been 
automated. 

In Figure 7, array x is indexed by m and cols(j ,i). If we were to carry out the work 
in this loop in a naive manner, we would have to fetch each individual distributed array 
element x(m,cols(j ,i)) (statement S4) from its assigned processor. Since the processor 
assignments of elements of x are stored in a distributed translation table, we would also need 
to access the memory of the processor that keeps track of where x(m,cols(j ,i)) is stored. 

In Table 1, we present the execution times on 32 processor and 64 processor Intel 
iPSC/860 machines, obtained from the block matrix vector multiply kernel as well as exe- 
cution times from another, more complex kernel that arose in an unstructured code. This 
kernel, to be referred to here as fluxroe , computes convective fluxes using a method based on 
Roe’s approximate Riemann solver [32], [33]; the kernel is discussed in some detail in [6]. 
Both the block matrix vector multiply and the fluxroe kernel arise from iterative algorithms. 
In these tests, fluxroe was translated into ARF and compiled. In these experiments, we used 
two different unstructured meshes: 

(i) A 21,672 element mesh generated to carry out an aerodynamic simulation involving a 
multielement airfoil in a landing configuration [19] 

(ii) A 37,741 element mesh generated to simulate a 4.2 % circular arc airfoil in a channel 
[14]. 

In all the cases presented below, each unstructured mesh was partitioned by recursive 
orthogonal dissection [13]. 

In table 1 we present: 

inspector time - time required to carry out the inspector preprocessing phase 

computation time - the time required to perform computations in the iterative portion 
of the program 

communication time - the time required to exchange messages within the iterative 
portion of the program. 

The inspector time includes the time required to set up the needed distributed translation 
table as well as the time required to access the distributed translation table when carrying 
out the preprocessing in the inspector. In these experiments, the ratio of the time required to 
carry out the inspector to the computation time required for a single iteration ranged from a 
factor of 3 to a factor of 5. Most of the preprocessing time goes to setting up and using the 
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distributed translation table. For instance, consider the block matrix vector multiply on 64 
processors using the 21,672 element mesh. The total preprocessing cost was 122 milliseconds, 
of which 111 milliseconds went to translation table related work. 

We can define parallel efficiency for a given number of processors P as the sequential 
time divided by the product of the execution time on P processors times P. In Table 1 we 
depict under the heading of single sweep efficiency, the parallel efficiencies we would obtain 
were we required to preprocess the kernel each time we carried out calculations. In reality, 
preprocessing time can be amortized over multiple mesh sweeps. If we neglect the time 
required to preprocess the problem in computing parallel efficiencies, we obtain the second 
set of parallel efficiency measurements, the amortized efficiency presented in Table 1. The 
amortized efficiencies for 64 processors ranged from 0.48 to 0.59, while the single sweep 
efficiencies ranged from 0.10 to 0.17. 

In the experiments depicted in Table 1, the time spent computing is at least a factor of 2 
greater than the communication time. The amortized efficiencies are, however, impacted by 
the fact that the computations in the parallelized codes are carried out less efficiently than 
those in the sequential program. The parallel code spends time accessing the hashed cache. 
It also needs to perform more indirections than the sequential program. 


Table 1: Performance on different number of processors 


nprocs 

inspector 

comp 

comm 

single sweep 

amortized 

nprocs 

time(ms) 

time(ms) 

time(ms) 

efficiency 

efficiency 


Block Matrix Vector Multiply - 21,672 element mesh 


32 


49 

9 

| 

0.55 

64 

BUM 

25 

9 

| 

0.48 


Block Matrix Vector Multiply - 37,741 element mesh 


32 

WE3M 

85 

10 

| 

0.59 

64 

Kfl 

42 

9 

■q^m 

0.54 


Fluxroe - 21,672 element mesh 


8 

231 

■■ 

24 

0.40 

■B 

16 

162 

mSm 

21 

0.34 


32 

135 

80 

22 

0.19 

0.57 

64 

172 

41 

19 

0.12 

0.48 


Fluxroe - 37,741 element mesh 


8 

393 

534 

23 

0.41 

■■ 

16 

249 

269 

18 

0.36 

mfiSM 

32 

191 

156 

23 

0.28 

■PM 

64 

203 

69 

14 

0.17 
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51 distributed irregular using row real x(4,n), y(4,n) 

52 distributed irregular using row integer cols(9,n), ncols(n) 

... initialization of local variables ... 

doall i=l,n 

do j=l,ncols(i) 

S3 do k=l,4 

sum = 0 
do m = 1,4 

S4 sum = sum + f(i,m,k,i)*x(m,cols(j,i)) 
enddo 

y( k »i) = y( k >i) + sum 

enddo 

enddo 

enddo 


Figure 7: ARF Kernel From Unstructured CFD Code 
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4.3 Future Optimizations 

Most of the optimizations described in this section are motivated either directly or indirectly 
by the high communication latencies typically found in distributed memory computers. Be- 
cause we can anticipate all of the interprocessor communications that will be needed in 
carrying out a loop, we have the information we need to schedule interprocessor communi- 
cations to reduce overheads due to contention. As we shall see in Section 5, scheduling of 
interprocessor communication has already been shown to be an important optimization for 
some SIMD architectures. We expect this to also turn out to be a fruitful optimization for 
distributed memory MIMD computers. 

Computations can be characterized by patterns of data dependency. Procedures that par- 
tition data structures and computational work take these dependency patterns into account. 
It is possible to design program transformations that generate procedures which output a 
record of the dependency patterns in a loop nest in a a standard representation [20]. Stan- 
dardized partitioning programs that use these data structures can then be employed. 

5 Runtime Compilation in SIMD Machines - the Com- 
munications Compiler 

Irregular problems can cause serious performance degradation on the CM-2 [5]. It turns 
out that this performance degradation can be ameliorated by a form of runtime compila- 
tion. Denning Dahl has developed a set of software facilities for the Connection Machine 
(CM-2) that are designed to handle applications that exhibit fixed irregular patterns of 
communication [10]. One procedure, the communications compiler schedules interprocessor 
communications. The other procedure a mapping facility maps graphs generated from a 
communication pattern onto the CM-2. In this paper, we will focus our attention on the 
communications compiler. 

The communications compiler decomposes an irregular communications pattern into a 
sequence of simple, inexpensive data transfers. These data transfers make use of the hy- 
percube communication network in the CM-2. In the CM-2, all links of the hypercube 
can simultaneously carry bidirectional information. The communications compiler attempts 
to reduce time required for communication by the judicious scheduling of messages. At 
present, the communications compiler is accessed by procedure calls. Lists of destination 
addresses are passed to the communication compiler’s preprocessor procedures. Once the 
preprocessing is completed, a data delivery function carries out the scheduled communica- 
tions. Transformations analogous to those discussed in Section 4.2 could be used to embed 
these communication compiler primitives into programs, and hence to generate inspectors 
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and executors. 

We present a set of benchmarks that quantifies the performance effects of the commu- 
nications compiler. A synthetic workload was defined in the following way. A square mesh 
in which each point was linked to four nearest neighbors was incrementally distorted. Ran- 
dom edges were introduced subject to the constraint that in the new mesh, each point still 

required information from four other mesh points. 

The following assumptions are inherent in our workload generator: makes 

(i) The problem domain consists of a 64 by 128 mesh of points which are numbered using 
their row major or natural ordering; 

(ii) Each point is initially connected to its four nearest neighbors 

(iii) Each link produced in the above step is examined, with probability q the link is replaced 
by a link to a randomly chosen point. 

An 8192 processor Connection Machine-2 was configured as a 64 by 128 torus. The 
mesh was mapped onto the torus in the obvious manner. A sweep over the mesh was then 
performed using the following communication mechanisms. 

(i) Get: The standard CM-2 general router is called four times, once for each of the four 
off-processor data elements needed by each processor. 

(ii) Compiled get: Communications compiled using the communications compiler; the com- 
munications compiler preprocessor was called four times, once for each of the four off- 
processor data elements required by each processor. The data delivery procedure is 
called four times during each mesh sweep. 

(iii) Compiled gather: Communications compiled using the communications compiler; a 
single call to the communications compiler preprocessor handles each processor s four 
data requests. For each iteration, a single data delivery function carries out all com- 
munication. 

(iv) NEWS: CM-2 communications procedures that transmit information using mesh em- 
bedded into hypercube by binary reflected gray code. NEWS was only used to bench- 
mark the completely uniform mesh (q — 0 ). 

The construction of the communication schedule took anywhere from 1 to 13 seconds. 

The results of these benchmarks are depicted in Figure 8. In these experiments we 

carried out sweeps over meshes generated by varying q from 0.0 to 0.5. For the uniform mesh 

(q=0), we used all four communications mechanisms described above. For the synthetically 
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generated irregular meshes, we used the standard CM get, the compiled get and the compiled 
gather. Let T NE ws represent the the time required by the CM-2 to to sweep over a regular 
mesh (q=0) using the NEWS mechanism; T NE ws was equal to 0.80 milliseconds. In Figure 
8, we compare T NEWS with the time taken by the CM-2 to sweep over irregular meshes 
using the standard CM get ( T get ), the compiled get (Tcget) and the compiled gather 
( Tcgather )• For the regular mesh, Tget > Tcget > and Tcgather were factors of 15.4, 2.2 
and 1.1 times larger than T NEWS . As q increased, the performance of the mesh sweep 
degraded significantly with all three routing mechanisms tested. For q = 0.5, Tget , Tcget > 
and Tcgather were factors of 22.6, 4.4 and 2.7 times larger than T NEWS . It is clear that 
runtime compilation techniques can play an important role in reducing communications 
costs for irregular problems on SIMD machines. The computational cost of the simulated 
annealing based communications compiler is, however, extremely high. 


6 Conclusions 

Execution time preprocessing plays a major role in many efficient algorithms in computer 
science. Runtime preprocessing also plays an important role in exploiting multiprocessor 
architectures. Examples of such preprocessing include runtime parallelization, runtime ag- 
gregation and scheduling of remote distributed array accesses and execution time data and 
workload partitioning. We have given examples of how optimizations of this type can be 
integrated into compilers. We have also presented specific benchmarks that document, on a 
range of multiprocessor architectures, the importance of various types of runtime compila- 
tion. 
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